This is an R Markdown document that was designed to review Seurat and
basic analysis of single cell RNA-seq (scRNA) data. For this tutorial we
will be using a few samples from GSE235326 from Kumar T,
Nee K, Wei R, He S et al. A spatially resolved single-cell genomic atlas
of the adult human breast. Nature 2023 Aug;620(7972):181-191. The
samples selected from this data set are from 10x Genomics Chromium
Single Cell 3’ v3.1 Chemistry.
This tutorial will review the following items in standard analysis of using the Seurat package: 1. Create Seurat objects * single sample * a list of samples
The follow are covered in the second RMD File:
Coursera courses you might consider for R coding
training although not for specific single cell analysis. 1. Applied Data
Science with R Specialization - focus more on applied coding.
General Written Resources
Youtube tutorials 1. for basic R understanding and tutorials consider these: * https://michaelgastner.com/R_for_QR/index.html (there are paired free youtube videos)
If you want to know more about a function there is a function you
can use ?FUNCTION_NAME to examine the inputs, where
FUNCTION_NAME is the name of the function using the correct
function capitalization.
If you are doing this in an interactive version, you can use the files location and walk the click through to get the data location and set a new directory - using R studio use the side panel with the Files and you can walk through your home directory and set a new directory.
We will be using the following libraries. These should be available
if you are using the server. However, you might need to install them.
Note this tutorial was prepared using
R version 4.2.0 (2022-04-22) and Seurat 4, but should work
fine with version 5.
If you needed to install them for:
Seurat - install.packages('Seurat')
ggplot - install.packages("ggplot2")
devtools - install.packages("devtools")
patchwork -
devtools::install_github("thomasp85/patchwork")
dplyr - install.packages("dplyr")
readxl - install.packages("readxl")
library(Seurat)
## Attaching SeuratObject
## Attaching sp
library(ggplot2)
library(patchwork)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
This will need to be changed to the specific location where you have placed the data. You should specify where you are working and set the directory.
Functions for this section:
<- Base R function - this is the assign function
that is fairly R specific. You can also use = which is a
more common assign syntax in other languages.
setwd Base R function - This will set your directory
to the string you provided. If you have permissions to that location and
if you have provided a validate option.
Note the slash direction is due to being run on
either a Mac or Linux, If you are using a Windows machine you your
slashes will be \ instead this can be set using the
. if are unsure.
Note Ideally, you would not need to set your working directory. Your files should be together in a folder. To this end, I suggest a structure of functions, reports, and data under each project folder. Example:
data
functions
figures
output
reports
Your structure will likely vary by project.
your_working_dir <- "/Users/akcasasent/Desktop/Teaching/GSE235326/"
setwd(your_working_dir)
For this tutorial make sure wherever you have set your working
directory that contains the folder called data in it and
within that folder you have downloaded the 3 or more h5 files from the
HBCA sample from GEO: GSE235326 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE235326
or for more complete items from https://explore.data.humancellatlas.org/projects/1ffa2223-28a6-4133-a5a4-badd00faf4bc
.
You can read in data from Read10X_h5 format or from more
basic output of Read10X which takes the input of a
directory with three files matrix.mtx file,
barcodes.tsv file, and genes.tsv file.
Note h5 is used because you can provide just one file with the same information in a sparse matrix format. This is the format these files were uploaded as into GEO which makes them the easiest to get in the case of this project.
Functions for this section:
list.files base R function - lists the file in a
vector
paste0 base R function - concatenated together
strings default is directly connected aka 0 spaces.
print base R function - prints output to
screen
Read10X_h5 seurat - reads in a h5
matrix and transforms it to a sparse matrix.
# read in a single file
data_file_list <- list.files("data")
data_file_list
## [1] "GSM7500359_hbca_c50_raw_feature_bc_matrix.h5"
## [2] "GSM7500360_hbca_c51_raw_feature_bc_matrix.h5"
## [3] "GSM7500362_hbca_c53_raw_feature_bc_matrix.h5"
## [4] "GSM7500363_hbca_c54_raw_feature_bc_matrix.h5"
## [5] "GSM7500495_hbca_c44_raw_feature_bc_matrix.h5"
## [6] "GSM7500496_hbca_c45_raw_feature_bc_matrix.h5"
## [7] "GSM7500499_hbca_c48_raw_feature_bc_matrix.h5"
## [8] "GSM7500500_hbca_c49_raw_feature_bc_matrix.h5"
## [9] "GSM7500501_hbca_c18_raw_feature_bc_matrix.h5"
## [10] "GSM7500512_hbca_c165_raw_feature_bc_matrix.h5"
## [11] "GSM7500513_hbca_c154_raw_feature_bc_matrix.h5"
# note this grabs just the first file
my_file_name <- paste0("data/",data_file_list[1])
print(my_file_name)
## [1] "data/GSM7500359_hbca_c50_raw_feature_bc_matrix.h5"
hbca.data <- Read10X_h5(filename = my_file_name)
For this example we are using
GSM7500359_hbca_c50_raw_feature_bc_matrix.h5. However, you
can change it by change the number here. Notice I am currently showing a
partially hard coded method so that I show you how you remove the
hardcoding when you run this in a loop. But so you can see what each of
items is or should be.
I usually suggest to be consistent with how you are assigning
features but here I just wanted to show that the = and
```<-```` both work to assign items.
Note We call these features not genes because we use a very similar form for other items. Such as when you have antibodies or other information also held in these locations.
min_cells = 3
min_features <- 200
Here we choose how much of the raw data we want to work with by setting specific limits.
Here we state that we require a gene to be present in at least
3 cells and each cell is required to have at least
200 features or genes.
Functions:
paste base R - also to Concatenate Strings but this
has standard sep is a single space. Also you can use
collapse to combine items across vectors.
strsplit base R - split strings based on Elements
here I used underscores. Note due to how it returns
items I removed the the list and just collect the vector.
: this is handy function to have sequential items.
1:4 creates a vector 1,2,3,4. In this case we select the the first
list[[1]] and the second and third item [2:3]
CreateSeuratObjectthis is a
SeuratObject function. It creates a Seurat object from
the raw data. During this step you can also set a min cells and min
features/genes which will remove items below the threshold. For a
completely raw data you would set both min cells and feature to
zero.
Note The min cells and min features are unique. For individual samples, if is often good to leave the min as zero because otherwise you can filter out genes that vary across different samples. Later, you can set the threshold for the whole data set as the same time. Further, you often want to determine the min gene experimentally. For example an immune sets often require a lower the threshold similar to the cut of used for frozen nuclei samples. Fresh tumor cells on the other hand often have more genes expressed and might require a higher threshold.
For min.cells a good starting place for a min
between 3-5 cells. However, sometimes it makes sense to not include this
threshold at all for a single sample and only apply it when you have
multiple samples.
For min.features (fresh single cell) a good starting
place for fresh samples is between 100 to 200. Lean towards lower bound
for immune heavy samples where you have selected for immune cells in
your experiment. If you have very viable and therefore very good samples
having this set at higher end makes sense. This number is also going to
be impacted by sequencing coverage and depth. I would usually start at
200 for these.
For min.features (frozen or single nuclei) a good
starting place for frozen or nuclei samples reasonable range is between
50 to 150. Lean towards lower bound for immune heavy samples where you
have selected for immune cells in your experiment. I would usually start
at 100 for these.
Alternative method of thresholding Other people also will choose their ranges using the standard dev or median absolute deviation. However you need to be careful with these and usually these do not help for setting the lower bound of the threshold so much as the upper.
# get the sample name by pulling it from the first second and third part of the string when split by underscores.
sample_name <- paste(strsplit(my_file_name, split = "_")[[1]][2:3], collapse = "_")
# creates an object, using the filters from the previous sections.
hbca <- CreateSeuratObject(counts = hbca.data, project = sample_name, min.cells = min_cells, min.features = min_features)
# let you look at the overview of the seurat object.
hbca
## An object of class Seurat
## 22719 features across 6446 samples within 1 assay
## Active assay: RNA (22719 features, 0 variable features)
Below we add the percentage of a specific feature. In this object we
are assigning the seurat object a meta feature called
percent.mt for percent of mitochondrial genes.
Functions:
PercentageFeatureSet - This Seurat specific function
allows you to use regular expression patterns to select features (genes)
and calculate the percentage of those genes within the gene expression
of your samples.For R regular expressions a good source is - https://r4ds.had.co.nz/strings.html or https://www.datacamp.com/tutorial/regex-r-regular-expressions-guide.
^ at the beginning of the regular expression this
notes that this should be the first character.
RP[SL] In this expression, “RP” matches genes that
encode for ribosomal proteins, “S” or “L” specifies the subunit of the
ribosome
:digit: matches any digit
| this an or functions
Here is the example with mitochondrial genes percentage, which is often used to assess sample quality with a high fraction of apoptotic or lysing cells. High in this case is consider bad.
hbca[["percent.mt"]] <- PercentageFeatureSet(hbca, pattern = "^MT-")
Ribosomal protein genes are used in the filtering of single-cell RNA sequencing data to assess the quality of the data and remove poor-quality cells. The proportion of ribosomal protein reads can be an indicator of RNA degradation, and high ribosomal protein content can be associated with specific cell types, such as immune cells.
However, they can also be regressed out depending on what you are analyzing. Note that in this case we have multiple patterns (genes starting with “RP”, “RPLP”, and “RPSA”).
hbca[["percent.rb"]] <- PercentageFeatureSet(hbca, pattern = "^RP[SL][[:digit:]]|^RPLP[[:digit:]]|^RPSA")
This produces a ratio that is related to how much saturation of the umis there per gene.
hbca[["umi_gene_ratio"]] <- log10(hbca@meta.data$nCount_RNA/hbca@meta.data$nFeature_RNA)
Add the Mitochondrial and Ribosomal genes percentage together.
hbca[["mt_rb"]] <- hbca@meta.data$percent.mt + hbca@meta.data$percent.rb
Examine the meta data where QC or cells specific information could be stored.
Functions:
head function from utils package (fairly basic R
package). This results the first part of a data frame, table, matrix or
vector.
tail function that starts from the bottom instead of
the top.
Specifically, these functions can be useful to examine the meta data in the Seurat object and look at top 5 items.
If you wanted to look at the whole thing you can use
view but it can crash R if too large.
head(hbca@meta.data, 5)
## orig.ident nCount_RNA nFeature_RNA percent.mt percent.rb
## AAACCCAGTAGCGTAG-1 hbca_c50 2017 878 0.594943 33.36639
## AAACCCAGTTGAGGAC-1 hbca_c50 4664 1164 10.484563 30.18868
## AAACCCATCCATCTAT-1 hbca_c50 531 363 3.013183 26.93032
## AAACCCATCGCGTGAC-1 hbca_c50 272 206 5.147059 22.42647
## AAACCCATCTGCCTGT-1 hbca_c50 302 209 5.960265 23.84106
## umi_gene_ratio mt_rb
## AAACCCAGTAGCGTAG-1 0.3612114 33.96133
## AAACCCAGTTGAGGAC-1 0.6028056 40.67324
## AAACCCATCCATCTAT-1 0.1651879 29.94350
## AAACCCATCGCGTGAC-1 0.1207017 27.57353
## AAACCCATCTGCCTGT-1 0.1598607 29.80132
It is also good to check the summary for the items before you move on.
summary(hbca@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## hbca_c50:6446 Min. : 232 Min. : 200 Min. : 0.0000
## 1st Qu.: 724 1st Qu.: 396 1st Qu.: 0.6446
## Median : 3208 Median : 1288 Median : 2.8137
## Mean : 7924 Mean : 1744 Mean : 5.1775
## 3rd Qu.: 10048 3rd Qu.: 2640 3rd Qu.: 6.1631
## Max. :308372 Max. :11107 Max. :94.5980
## percent.rb umi_gene_ratio mt_rb
## Min. : 0.4127 Min. :0.05469 Min. : 1.16
## 1st Qu.:14.2498 1st Qu.:0.23833 1st Qu.:18.34
## Median :19.1717 Median :0.41628 Median :23.59
## Mean :20.1745 Mean :0.43186 Mean :25.35
## 3rd Qu.:24.8442 3rd Qu.:0.60649 3rd Qu.:30.50
## Max. :71.1422 Max. :1.58865 Max. :95.16
One of the first items you want to do when looking at your data is visualize the quality of the samples.
Visualize QC metrics as a violin plot.
Functions:
VlnPlot - Seurat specific function which can show
violin plot of single cell data. This can be anything from QC metrics,
meta data, scores or single or combined gene expression values. Note you
can have multiple items including different ncols.When you examining the violin plots, some cut offs to keep in mind are either using the standard deviations or median absolute deviations. Alternatively strict pre-defined cuts offs are possible. However, these are often very arbitrary and likely need to vary by sample.
As discussed previously:
nFeature: represents the number of unique genes detected in each cell
nCount: represents the total number of molecules (UMIs) detected within a cell
umi gene ratio: represents the number of genes detected per UMI (unique molecular identifier) for each cell. A higher UMI gene ratio indicates that more genes are detected per UMI, which is generally considered a good indicator of data quality. However, can also show duplicates.
percent mt: percent of mitochondrial genes.
percent rb: percent of Ribosomal genes.
mt rb: percent of mitochondrial genes and Ribosomal genes together.
VlnPlot(hbca, features = c("nFeature_RNA", "nCount_RNA", "umi_gene_ratio", "percent.mt", "percent.rb", "mt_rb"), ncol = 3)
Visualize QC metrics as a scatter plot to visualize feature-feature relationships.
Example:
nCounts vs nFeatures (Where an identity line is almost expected)
nCounts vs mito genes (where the relation is less focused)
Functions: 1. FeatureScatter - Seurat function to create
a scatter plot using 2 different features. Here you can examine how
these items are correlated or not. Note this by default using
Pearson correlation between the 2 features. In the most
basic form you specific the 2 different features.
plot1 <- FeatureScatter(hbca, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2 <- FeatureScatter(hbca, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1 + plot2
nCounts vs ribo genes (where the relation is less focused)
mito genes vs ribo genes
plot3 <- FeatureScatter(hbca, feature1 = "nCount_RNA", feature2 = "percent.rb")
plot4 <- FeatureScatter(hbca, feature1 = "percent.mt", feature2 = "percent.rb")
plot3 + plot4
Notes
Pearson correlation does not always makes sense for these types of comparison. For example it assume:
Independence of variable. Example of issue - percentage of RB and MT expression are going to be dependent as they are a calculation of percentage.
Proportion Data - percentage also violates this.
Nonlinear Relationships - can still be significant relationships
Pearson correlation is a parametric test - which assume a normal distribution of data and is used to measure the strength and direction of the linear relationship in 2 different variables. However, single cell data is rarely “normally distributed”.
Therefore here we show how to calculate spearman correlations for the meta.variables across all cells.
Functions:
cor - this allows you to correlate two matrices or
data.frames.
[,-1] - this removes the first column. Note
R is indexing starts with 1 and the negative says to
remove it.
3.[row,col] - this lets you access items in a
matrix.
head(hbca@meta.data,5)
## orig.ident nCount_RNA nFeature_RNA percent.mt percent.rb
## AAACCCAGTAGCGTAG-1 hbca_c50 2017 878 0.594943 33.36639
## AAACCCAGTTGAGGAC-1 hbca_c50 4664 1164 10.484563 30.18868
## AAACCCATCCATCTAT-1 hbca_c50 531 363 3.013183 26.93032
## AAACCCATCGCGTGAC-1 hbca_c50 272 206 5.147059 22.42647
## AAACCCATCTGCCTGT-1 hbca_c50 302 209 5.960265 23.84106
## umi_gene_ratio mt_rb
## AAACCCAGTAGCGTAG-1 0.3612114 33.96133
## AAACCCAGTTGAGGAC-1 0.6028056 40.67324
## AAACCCATCCATCTAT-1 0.1651879 29.94350
## AAACCCATCGCGTGAC-1 0.1207017 27.57353
## AAACCCATCTGCCTGT-1 0.1598607 29.80132
# remove the first column because it is
spearman_corr <- cor(x=hbca@meta.data[,-1], method = "spearman")
spearman_corr
## nCount_RNA nFeature_RNA percent.mt percent.rb umi_gene_ratio
## nCount_RNA 1.00000000 0.98546530 -0.2422277 -0.03888511 0.94095420
## nFeature_RNA 0.98546530 1.00000000 -0.2647517 -0.09519991 0.87383896
## percent.mt -0.24222771 -0.26475172 1.0000000 -0.17478919 -0.15392484
## percent.rb -0.03888511 -0.09519991 -0.1747892 1.00000000 0.06030663
## umi_gene_ratio 0.94095420 0.87383896 -0.1539248 0.06030663 1.00000000
## mt_rb -0.19085446 -0.27892687 0.3182172 0.77800426 0.00307772
## mt_rb
## nCount_RNA -0.19085446
## nFeature_RNA -0.27892687
## percent.mt 0.31821724
## percent.rb 0.77800426
## umi_gene_ratio 0.00307772
## mt_rb 1.00000000
Items can also be access by name rownames and colnames.
spearman_corr["nCount_RNA","nFeature_RNA"]
## [1] 0.9854653
Other useful function: signif - rounds the values to the
specified number of digits.
sign - returns the sign of vector as either positive or
negative numbers.
This is where we reach more of a grey area with respect to quality control cut offs. How you set your cutoffs and the rationale behind setting them will strongly depend on the project. Outside of hard cut off values (thresholds), you may also choose to set your cut offs by standard deviation from the mean (parametric method - not preferred for single cell data) or median absolute distance from the median (non-parametric method - slightly more preferred in single cell data.)
Common standards that people often put in place are to cut off include using: standard dev (parametric), median absolute distance (non-parametric) or a hard number cut off (threshold) to be used across samples. The easiest to check is the hard threshold, but these can be troublesome causing loss of too much data.
For example, cancer treatment samples, where you might have huge changes in viability. Therefore, a strict viability for the pre-treated and post-treated samples might not be comparable.
In the paper for these samples the following cut offs were used, therefore we will use similar values. You can find these in the methods section for the HBCA paper. So as to not compare apples to oranges, we will use these cuts offs.
min_nFeature = 200
max_nFeature = 2500
min_nCount = 500
max_nCount = 20000
max_mito = 10
max_ribo = 50
However, note that these items my be different based one:
in our data sets, we saw some of immune cells tend to have lower number of expressed genes.
tumor cells are often have increased ploidy and therefore an increase in number of genes.
Function:
subset - this functions allows you subset or filter
you data based on metrics. Here we use all the cuts of in a single
line.
& logical operations for AND.
< comparison operations for less than
> comparison operations for greater than
So, this subset function below removed cells will
any of these thresholds.
# an unfiltered version to show items
unfiltered_hbca <- hbca
# filter the main object of interest
hbca <- subset(hbca, subset = nFeature_RNA > min_nFeature &
nFeature_RNA < max_nFeature &
nCount_RNA > min_nCount &
nCount_RNA < max_nCount &
percent.mt < max_mito &
percent.rb < max_ribo)
Note you could also have set a cut off based on
mad and median or sd and
mean. These often allow more data to be included, but
provide issues when not well documented about “why” a cut off was
selected. Also, if per sample and some people find this bad.
Question: How would you select the mad below and above the median for each of these cut offs?
Note that for each of the filters you will lose data. Sometimes people like to examine what each of these items are so you can see what you are loosing as you filter you samples.
Here, we created a function in order to draw the lines on the plots easier. When created, a function each item that you leave unassigned means that input is required.
The ones where you have an assignment, becomes the defaults.
The ... (Ellipsis) allows assignment of underling
parameters. I generally suggest when passing a function you allow this
to be passed as well because it allows the end user to making more
adjustments if they need.
Note we declare the function, including setting it for standard use for with min and max.
VlnPlot_lines <- function(data, myfeature, vcutoff, vcolor=c("blue", "red"), vplacement=c(1, 0.95), my.pt.size=0, plotfill="grey", ...)
{
vplot_feature <- VlnPlot(data, features = myfeature, pt.size = my.pt.size) +
scale_fill_manual(values = plotfill) +
geom_hline(yintercept = vcutoff, color = vcolor)
for(index in 1:length(vcolor))
{
vplot_feature <- vplot_feature + annotate("text", x = vplacement[index], y = vcutoff[index], label = sum(data[[myfeature]] < vcutoff[index]), vjust = -1, hjust = 0, color = vcolor[index])
}
print(vplot_feature)
}
Here we provide the feature with the number of cells that would be cut off at each these cuts offs.
VlnPlot_lines(data=unfiltered_hbca, myfeature="nFeature_RNA", vcutoff=c(min_nFeature, max_nFeature), vplacement=c(1.45,.75))
VlnPlot_lines(data=unfiltered_hbca, myfeature="nCount_RNA", vcutoff=c(min_nCount, max_nCount), vplacement=c(1.10,.83))
VlnPlot_lines(data=unfiltered_hbca, myfeature="percent.mt", vcutoff=c(max_mito), vcolor="red", vplacement=.85)
VlnPlot_lines(data=unfiltered_hbca, myfeature="percent.rb", vcutoff=c(max_ribo), vcolor="red", vplacement=.85)
In order, we will run these filters and record the number of cells removed at each step of the filter. Note it is possible for more than one filter to remove the same cell. This means when we add up all the cells removed, some cells may be counted more than once.
Here we will count for each feature how many cells are removed.
Question - How would you determine which filters removed the same cells?
Ways of measuring some of the filters for specific items.
Count off nFeature_RNA > min_nFeature results in
24 being filtered.
nFeature_RNA < max_nFeature results in
1700 being filtered.
nCount_RNA > min_nCount results in
1190 being filtered.
nCount_RNA < max_nCount results in
747 being filtered.
percent.mt < max_mito results in 742
being filtered.
percent.rb < max_ribo results in 75
being filtered.
Total amount filtered is 3495.
Note these some cells might have been filtered multiple times in this description. How can you tell if an item would be counted more than once?
VlnPlot(hbca, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "percent.rb"), ncol = 4)
#show with unfiltered and ABlines
In this section, we demonstrate an alternative filtering method that could be used instead of strict cut off. These cut offs would instead be calculated for each sample. However, one issue with this method is that it is harder to double check what you have filtering. Alternatively, the threshold method can simply be checked using print max for each cutoff.
In order to preform this, I need to introduce you to some other functions that are from base and based on calculation.
Functions:
apply this lets you run through vector, array or
list. This function allows better parallelization of a for loop. If
using 1 you are applying across rows, if using 2 you are applying across
columns. You can write you own functions or use functions already in R.
Such as those provided here: median, mad, mean, sd, and others.
median calculated the median from a vector –
nonparametric tests
mad calculated the median absolute value from a
vector – nonparametric tests
mean calculated the mean from a vector – parametric
tests
sd calculated the standard deviation value from a
vector – parametric tests
Note the Apply function can be confusing to start however, it is very similar to a for loop. Where the function is applied for each column or row in the array.
If you need to gain more understanding of for loops: 1. https://www.geeksforgeeks.org/for-loop-in-r/
# save an unfiltered version to show items
hbca_mads <- apply(unfiltered_hbca@meta.data[,-1], 2, function(my_col_vector) {mad(my_col_vector)})
# Example of cut off max and min cuts offs using 3 mads +/- the median (with zero preventing items going outside negative)
#note the 3 is hard coded you might consider using 2 or even 2.5.
hbca_cutoffs <- apply(unfiltered_hbca@meta.data[,-1], 2, function(my_col_vector)
{
#
my_max = median(my_col_vector) + 3*mad(my_col_vector) # added 3
my_min = median(my_col_vector) - 3*mad(my_col_vector) # added 3
if(my_min < 0)
{
print(paste("before changes to zero my min was:", my_min))
my_min <- 0 # set the min to zero if less than zero
}
data.frame(min=my_min, max=my_max)
})
## [1] "before changes to zero my min was: -9154.6601"
## [1] "before changes to zero my min was: -2986.8358"
## [1] "before changes to zero my min was: -7.85113880856232"
## [1] "before changes to zero my min was: -3.89892107113198"
## [1] "before changes to zero my min was: -0.403586045213972"
## [1] "before changes to zero my min was: -2.64581791643574"
hbca_cutoffs
## $nCount_RNA
## min max
## 1 0 15570.66
##
## $nFeature_RNA
## min max
## 1 0 5561.836
##
## $percent.mt
## min max
## 1 0 13.47858
##
## $percent.rb
## min max
## 1 0 42.24229
##
## $umi_gene_ratio
## min max
## 1 0 1.236152
##
## $mt_rb
## min max
## 1 0 49.82016
# filtering based on this called
hbca_mad_filtered <- subset(hbca, subset = nFeature_RNA > hbca_cutoffs$nFeature_RNA$min &
nFeature_RNA < hbca_cutoffs$nFeature_RNA$max &
nCount_RNA > hbca_cutoffs$nCount_RNA$min &
nCount_RNA < hbca_cutoffs$nCount_RNA$max &
percent.mt < hbca_cutoffs$percent.mt$max &
percent.rb < hbca_cutoffs$percent.rb$max)
Note that these cuts off are a bit less strict that was set previously with our hard cuts offs on the low end, but actually more strict than cut off set on the high end. Overall, you might end up losing more cells using these cuts offs, but less on the low end due to the center of the distribution.
For example,
nFeature_RNA > min_nFeature results
inprevious results was 24 being filtered.
new results would be 0 being filtered.
nFeature_RNA < max_nFeature results inprevious result was 1700 being filtered.
new result would be 158 being filtered.
nCount_RNA > min_nCount results inprevious result was 1190 being filtered.
new result would be 0 being filtered.
nCount_RNA < max_nCount results inprevious result was 747 being filtered.
new result would be 1091 being filtered.
percent.mt < max_mito results inprevious result was 742 being filtered.
new result would be 476 being filtered.
percent.rb < max_ribo results inprevious result was 75 being filtered.
new result would be 194 being filtered.
Previous total amount filtered is 3495.
For the mad filtering total amount filtered would be
3566.
Another reason not to use the mean is the shear number of zeros you will have the single cell data. This will result may much lower cuts offs than median methods.
Overall, the threshold we choose would results in the following
71 cells compared to the previous hard cut offs.
We will be using the hard cuts for the rest of the analysis. However, this section is meant to show you a different way to think about the sample specific cut offs.
Questions
What are the pros and cons of sample specific cut offs?
What are the pros and cons of hard preset cut offs?
We diverge a little from the paper here due to using the SCTranform on mito but it possible to do this on other features which you feel are changing or most effecting the outcome. This is one of the slower functions and one I usually suggest that you use on a cluster, not on individuals laptops. However, most standard laptops and desktops can handle 1 to 5 samples for SCTransform or around 5k to 50K cells. When you go beyond 50K cells I suggest you start looking into a cluster, or verify that you computer can handle it. Note, if you do not have multiple processors you will run into issues or the analysis will be very slow.
Seurat provides a fairly good walk through - https://satijalab.org/seurat/articles/sctransform_vignette
It also works to replace the following three commands which were used
in the paper: specifically NormalizeData(),
ScaleData(), and FindVariableFeatures().
Function: 1. SCTransform - There is a whole vignette and
papers specifically on what this does. However, what should keep in mind
is this (normalizes, scales, regresses out specific variables usually
items which are thought to relate to quality). In this case we will be
using percentage mito to regress out.
Specifically this is used to “correct” the counts. To make them easier to use you use log1p(counts) - changes the data shape to be more of normal distribution.
The scale.data so that more center by calculate pearson residuals.
This also includes the sctransform variance stabilization saved in a separate data slot called new assay SCT. Note, this is one of the reasons we can overwite our data because the result is the previous data with an additional data slot.
This can be slow and often people do not look at all features, but can change the “variable.features.n” to have more items but currently it only examines the top features.
Questions to Consider
NormalizeData() function do?Hint the standard is NormalizeData() with a scale factor
of 10000.
ScaleData() function do?Hint think of what scale.max, center and scale mean within the function.
FindVariableFeatures() do?Hint think about what selection.method means for finding variable features.
SCTransform() in your own words?Hint think about what regressing out a variable does and means for an analysis.
Why would on want to SCTransform() function instead
just running NormalizeData(), ScaleData(), and
FindVariableFeatures() ?
What is something common to regression on in
SCTransform() function?
Does regressing this variable out making biological sense to you or not? (Why?)
hbca <- SCTransform(hbca, vars.to.regress = "percent.mt", do.scale = TRUE, verbose = FALSE)
Here we examine the identified most highly variable genes.
These genes might provide information about the sample or clusters. If you are looking at multiple different items, they might show you batch effects or effects between groups. House keeping genes would not be expected to be in this, but cell type or treatment specific genes might show up here.
top10 <- head(VariableFeatures(hbca), 10)
top10
## [1] "HSPA1A" "HSPA6" "AC108134.2" "CFD" "CXCL8"
## [6] "RMRP" "DCN" "FTL" "APOD" "PLCG2"
Here we plot the genes to examine the different features with labels.
Function:
VariableFeaturePlot this highlights all the variable
features. this is a dot plot that shows residual variance and the
generic mean.# plot variable features with and without labels
var_plot <- VariableFeaturePlot(hbca)
labtop10_plot <- LabelPoints(plot = var_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
labtop10_plot
Here we start to run different linear reduction. One of the the most standard methods of dimensional reduction is PCA.
Function:
RunPCA This run the PCA for dimension reduction.hbca <- RunPCA(hbca, features = VariableFeatures(object = hbca))
## PC_ 1
## Positive: HLA-E, CD74, ADGRL4, HLA-B, PECAM1, TM4SF1, MCTP1, B2M, HLA-DRA, PCAT19
## ETS2, SRGN, GIMAP7, MYCT1, EMCN, HLA-DPA1, EGFL7, CD93, PTPRB, FLT1
## ACKR1, CDH5, S1PR1, HLA-A, PLVAP, GNG11, HLA-DRB5, GIMAP4, PNP, TM4SF18
## Negative: DCN, CCDC80, COL1A2, LUM, COL6A2, C1S, COL6A1, LRP1, COL1A1, FBLN1
## COL3A1, C1R, MGP, GSN, OGN, MFAP4, C3, COL6A3, SERPINF1, SERPING1
## PLAC9, CTSK, SFRP2, CXCL14, PCOLCE, APOD, CFD, TIMP2, COL14A1, SELENOP
## PC_ 2
## Positive: C11orf96, EIF1, ODC1, EIF4A3, ARID5B, H3F3B, CREM, CYCS, GEM, PTP4A1
## LY6K, NCL, TUBB4B, CEBPB, AXL, SERTAD1, CDKN1A, H2AFZ, NR4A3, HIPK2
## PTTG1, PEG10, ZNF331, FOXC2, VASN, PHLDA2, NFIL3, IER3, PPP1R15A, RAN
## Negative: TXNIP, PECAM1, TGFBR2, EGFL7, PLCG2, CD34, NPDC1, SPARCL1, CD74, EMCN
## PALMD, AQP1, HLA-DRA, GIMAP7, ACKR1, ITM2B, HLA-DPA1, PLVAP, IL33, ZNF277
## LDB2, PTPRB, CYYR1, IFI27, TSPAN7, CLEC14A, LIFR, ADGRL4, ITM2A, NOSTRIN
## PC_ 3
## Positive: SPARCL1, MYL9, CAV1, TAGLN, MCAM, CALD1, EPAS1, TINAGL1, ACTA2, ADIRF
## TPM1, MEF2C, PLCG2, DSTN, MYH11, TPM2, SORBS2, NOTCH3, BCAM, NR2F2
## PPP1R14A, A2M, LMOD1, TSC22D1, TNS1, PLN, GUCY1A1, TIMP3, CRIP2, LPP
## Negative: PTPRC, TRBC2, IL7R, STK4, TRAC, CYBA, CYTIP, CD2, CCL5, LCP1
## CD3E, EVI2B, LAPTM5, CD3D, CLEC2D, MBP, CD53, SPOCK2, FYB1, CD247
## TRBC1, SAMSN1, ALOX5AP, CD52, CXCR4, CST7, PTPN22, EMB, CD7, FTL
## PC_ 4
## Positive: EMP1, ANXA5, TM4SF1, ANXA2, DAB2, EIF1, TFPI, HMOX1, DDX21, ABL2
## DCN, PNP, FOSL1, GJA1, CD55, TXN, FBLN1, CFD, ATP1A1, GPRC5A
## RAN, MAP1LC3B, SERPINE1, MYCT1, ETS2, PCAT19, IL1R1, DPT, FTH1, GSN
## Negative: PTPRC, ACTA2, TAGLN, TRBC2, MYL9, IL7R, CALD1, CD2, CCL5, MYH11
## TRAC, TPM1, CD3E, CYTIP, STK4, LMOD1, PLN, CLEC2D, CARMN, CD3D
## NOTCH3, CYBA, CD247, SPOCK2, TRBC1, MCAM, PPP1R12B, SYTL2, PPP1R14A, DSTN
## PC_ 5
## Positive: GADD45B, H3F3B, ID2, HEXIM1, NR4A1, JUNB, BTG1, JUN, FOS, HES1
## ATF3, NR4A2, DNAJB1, GADD45G, CBX4, DDIT4, SLC38A2, SAMD1, IER3, C11orf96
## FOSB, UBB, INTS6, HSP90AA1, DUSP1, IER2, STMN1, KLF4, RRAD, ADAMTS1
## Negative: KRT17, COL17A1, KRT5, SERPINB5, SPINT2, DSP, KRT7, SFN, PERP, TACSTD2
## KRT14, DST, NRG1, PIK3C2G, RPS26, S100A11, ARHGEF35, LAMB3, ACTN1, FHL2
## PAWR, SAA1, TPM1, ITGA2, AZGP1, MGST1, ENO1, CDH1, OXTR, SYT8
Here we examine and visualize the PCA results by printing them out.
print(hbca[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: HLA-E, CD74, ADGRL4, HLA-B, PECAM1
## Negative: DCN, CCDC80, COL1A2, LUM, COL6A2
## PC_ 2
## Positive: C11orf96, EIF1, ODC1, EIF4A3, ARID5B
## Negative: TXNIP, PECAM1, TGFBR2, EGFL7, PLCG2
## PC_ 3
## Positive: SPARCL1, MYL9, CAV1, TAGLN, MCAM
## Negative: PTPRC, TRBC2, IL7R, STK4, TRAC
## PC_ 4
## Positive: EMP1, ANXA5, TM4SF1, ANXA2, DAB2
## Negative: PTPRC, ACTA2, TAGLN, TRBC2, MYL9
## PC_ 5
## Positive: GADD45B, H3F3B, ID2, HEXIM1, NR4A1
## Negative: KRT17, COL17A1, KRT5, SERPINB5, SPINT2
Here we visualize the PCA results for the first 2 dimensions looking at the associated with reduction components.
Function:
VizDimLoadings This examine the top genes associated
with reduction components, but component spread.VizDimLoadings(hbca, dims = 1:2, reduction = "pca")
This next visualization is the most standard method of looking at these items.
Function:
DimPlot Plots the PCA for the first components.DimPlot(hbca, reduction = "pca") + NoLegend()
This next visualization examined the top genes that are differentially expressed for the first nine Principle components. This example used 500 cells and you can see the move variable balanced selection for either end of the expression.
Function:
DimHeatmap this shows the different genes expressed at
the different edges of each principle components.DimHeatmap(hbca, dims = 1:9, cells = 500, balanced = TRUE)
paper_dim <- 20
One issue that usually occurs is selecting the best dimensions the
number of dims to be used. In the paper they used around
20. However, if you examine the curve for a single sample
you might see more of slow progression.
Function:
ElbowPlot this plots the standard dev of the PCA to
try. You can actually change which type of reduction you use too. In
theory, you want to select where it starts to bend.ElbowPlot(hbca, ndims = 50)
Here we find the neighbors using the selected number of dims. Since
we are copying the paper so we select 20 we used that to
find the Neighbors.
Function:
1.FindNeighbors find it based on graph that was used,
with a specific metric. It runs through multiple set ups. You can adjust
a few items here such as metric - the distant standard used is
euclidean.
hbca <- FindNeighbors(hbca, dims = 1:paper_dim)
## Computing nearest neighbor graph
## Computing SNN
For clustering in the paper they use a specific resolution.
low_res <- 0.1
high_res <- 0.8
paper_res <- 0.2
Find the clusters for the three different resolutions. Function:
FindClusters based on the resolution.Setting resolution basically find resolution and adjust.
hbca <- FindClusters(hbca, resolution = c(low_res,high_res,paper_res))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2951
## Number of edges: 88513
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9559
## Number of communities: 6
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2951
## Number of edges: 88513
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8663
## Number of communities: 15
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2951
## Number of edges: 88513
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9373
## Number of communities: 8
## Elapsed time: 0 seconds
Create the Umap visualization based on the paper dims.
Note that it is important items to keep in mind is that you can change a lot of how cells appear or are displayed in a umap using parameters such as spread.
Note that the seed can change out the outcome.
Note other parameters some will use to adjust to get
the “best umap graph” include but are not limited to
n.neighbors, n.epochs, min.dist
and spread.
You can use functions like group.by to provide splitting.
hbca <- RunUMAP(hbca, dims = 1:paper_dim, spread=1)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:37:19 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:37:19 Read 2951 rows and found 20 numeric columns
## 16:37:19 Using Annoy for neighbor search, n_neighbors = 30
## 16:37:19 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:37:19 Writing NN index file to temp file /var/folders/3_/lw61drbn68xglctt7ljrsmrcd_ghr5/T//RtmpKAKmSs/file7fc8431d9c88
## 16:37:19 Searching Annoy index using 1 thread, search_k = 3000
## 16:37:20 Annoy recall = 100%
## 16:37:21 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:37:24 Initializing from normalized Laplacian + noise (using irlba)
## 16:37:24 Commencing optimization for 500 epochs, with 115736 positive edges
## 16:37:29 Optimization finished
DimPlot(hbca, reduction = "umap", group.by=paste0("SCT_snn_res.", c(low_res, high_res, paper_res)))
One of the major parts of single cell analysis is the differential expression (DE) between clusters or groups.
In this case we can see an example of what the DE analysis between 2
groups is. The default is examining a single cluster (in this case
looking at #6) against all other clusters.
cluster6.markers_all <- FindMarkers(hbca, ident.1 = 6)
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
head(cluster6.markers_all, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## KRT5 1.407663e-249 0.7438792 0.531 0.009 2.570956e-245
## PIK3C2G 5.062143e-223 0.5295765 0.441 0.005 9.245498e-219
## COL17A1 9.734612e-211 0.6877218 0.524 0.014 1.777929e-206
## SERPINB5 3.910547e-199 0.4055505 0.364 0.002 7.142223e-195
## KRT17 2.161144e-190 1.2209416 0.748 0.053 3.947113e-186
One of the common methods of determine what clusters are is to look at what their top genes are compared to other clusters, and to examine only positive markers meaning the presence is noticed not in something else.
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
hbca.markers <- FindAllMarkers(hbca, only.pos = TRUE)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
hbca.markers.top <- hbca.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 0.8)
dim(hbca.markers)
## [1] 1330 7
There are a number of different methods to compare the DE using Seurat. We will discuss some of them after merging as you usually want to handle them across groups.
Often it is good to have a some genes that you expect to be high in specific clusters based on literate.
Note that depending on your cell types, organ type and whether you are looking at single cell or single nuclei and what species you are looking at the expect gene list might change.
However, it is good to have an idea what you want to look at first. Not so much as a guide, but more as a sanity check.
For example if you do FACs sorting, for example enriching for CD45+ cells, you expect that will have enrichment for various types of white blood cells, particularly T cells and B cells. Therefore it is good to have both markers for what you expect to be there and what you expect to be removed, namely you expect to have less Endothelial and Epithelial genes.
I strongly recommend actually putting this into an tab delim sheet and reading it in. That will allow the genes to be a little more fluid.
Requires:
readxl - library set that lets you read in excel
files. This is a very useful package that allows you use excel sheets in
addition to csv files.
read_excel - actual function for reading in excel
files.
# partly hard coded due to path name but inside is not
my_Guided_Markers_path <- paste0(your_working_dir,"/input/Guided_Markers.xlsx")
# read file in
user_markers <- readxl::read_excel(my_Guided_Markers_path)
user_markers <- as.data.frame(user_markers)
# view file
user_markers[1:3,]
## Celltype Description Gene Sources
## 1 Endothelial Endothelial CDH5 Core_List
## 2 Endothelial Endothelial VWF Core_List
## 3 Endothelial Endothelial PECAM1 Core_List
However you can also fully hardcode it like this.
# Non Immune
# Endothelial
Endo_genes <- c("CDH5","VWF","PECAM1","ESAM","THBD","CD36", "CD34", "HEG1")
# Epithelial
Epi_genes <- c("EPCAM","KRT5","KRT17","KRT18","KRT19", "KRT8", "KRT14") #"KRT6"
# Basal
Basal_genes <- c("KRT14","KRT17","KRT5")
# LumHR
LumHR_genes <- c("KRT18","KRT8","KRT19")
# LumHR
LumSec_genes <- c("KRT15","KRT23","KRT16","KRT7")
kartins <- c(Basal_genes, LumHR_genes, LumHR_genes)
# Adipose
Adipo_genes <- c("APMAP","ADIPOQ","ADIPOQ-AS1","TPRA1")
# Fibroblast Genes
Fibro_genes <- c("COL1A1","LUM","FBLN1", "DCN", "FBN1", "COL1A2", "PCOLCE")
non_immune_genes <- c(Endo_genes, Epi_genes, Adipo_genes,Fibro_genes)
#########################################################################
# Immune and lymphocytes
# T-cells
Tcell_genes <- c("CD3D","CD4","CD8A", "TRAC", "TRBC1", "TRBC2", "IL7R", "CD96")
# B-cells
Bcell_genes <- c("CD19","MS4A1","CD79A","CD79B","BANK1", "SEL1L3", "IGHM", "IGLC3")
#NK killer cells
NKill_genes <- c("NKG7","GNLY","CTSW")
# General Immune genes
Immun_genes <- c("PTPRC","SRGN","TYROBP") # double check
lymphocytes_cluster_genes <- c(Tcell_genes, Bcell_genes, NKill_genes,Immun_genes)
#########################################################################
# Myeloid
myeloid_genes <- c("FCER1G", "ITGAM", "FCGR3A") # the CD1c+ myeloid DCs, the CD141+ myeloid DCs and the CD303+
# Red Blood Cells
RedBld_genes <- c("HBA1","HBA2","HBB")
# Macrophages
Macro_genes <- c("CD14","CD68","CD163","C1QA", "C1QB", "MRC1", "MSR1", "MS4A7")
# Monocytes
Monoc_genes <- c("LST1","FCGR3A","AIF1", "MAFB", "MS4A6A", "ANPEP", "LYZ")
# Dendrite genes
Dendr_genes <- c("LILRA4", "IRF7", "CLEC9A", "CD1C", "TCF4")
myeloid_cluster_genes <- c(myeloid_genes,RedBld_genes,Macro_genes, Monoc_genes,Dendr_genes)
#########################################################################
# here are some qc features ones might use.
QC_features <- c("nCount_RNA", "nFeature_RNA","percent.mt", "percent.rb")
Often there are a few genes that we like to examine based on canonical markers or test the quality control of the different clusters.
For example clusters with very low number of counts, or high mito or ribo might be removed manually in addition tot the cut off.
Further clusters have genes that are expected to never be combined are sometimes joined.
VlnPlot - draws a violin plot of the data usually based
on gene expression, metrics or scores. It is often groupped by cluster
or cell type.This plots shows the most common features that are using during quality control for single cell RNA data.
VlnPlot(hbca, features = QC_features, assay="RNA", group.by=paste0("SCT_snn_res.",paper_res))
You can do this for other genes as well.
VlnPlot(hbca, features = Endo_genes[1:6], assay="RNA", group.by=paste0("SCT_snn_res.",paper_res))
FeaturePlot - this plots a umap but colors the item
based on the number. Color scale and what assay is plotted can be
changed among other items.You can similarly examine genes using the feature plot to look at the QC results.
FeaturePlot(hbca, features = QC_features)
You can examine the genes in the umaps as well.
FeaturePlot(hbca, features = Endo_genes[1:4])
FeaturePlot(hbca, features = Immun_genes[1:4])
## Warning in FetchData.Seurat(object = object, vars = c(dims, "ident", features),
## : The following requested variables were not found: NA
FeaturePlot(hbca, features = Fibro_genes[1:4])
This is very useful when examine two genes that where you are interested in the interactions, or you wanted to look at focal group.
FeaturePlot(hbca, features = c("DCN", "FBLN1"), blend = TRUE)
These are from using the plot feature ggridges. It
allows us to visualize single cell expression distributions in each
cluster while looking at the distribution.
RidgePlot shows the expression of genes across
clusters. This is not as smoothed out as violin.You can examine the genes in the umaps as well.
RidgePlot(hbca, features = Endo_genes[1:4], ncol = 2)
## Picking joint bandwidth of 0.0163
## Picking joint bandwidth of 0.0166
## Picking joint bandwidth of 0.0223
## Picking joint bandwidth of 0.0145
RidgePlot(hbca, features = Fibro_genes[1:4], ncol = 2)
## Picking joint bandwidth of 0.045
## Picking joint bandwidth of 0.029
## Picking joint bandwidth of 0.0281
## Picking joint bandwidth of 0.0619
RidgePlot(hbca, features = Immun_genes[1:4], ncol = 2)
## Warning in FetchData.Seurat(object = object, vars = features, slot = slot): The
## following requested variables were not found: NA
## Picking joint bandwidth of 0.0634
## Picking joint bandwidth of 0.125
## Picking joint bandwidth of 0.0349
While it is good to visualize these items more side by side,
sometimes it is good to examine across the different samples. Dot plots
are a good way to do that. Dot plots often allow you to examine a larger
number of genes across clusters and, if doing manual naming, can assist.
However, manual naming is not always the best. Function: 1.
DotPlot shows the proportion of features by cluster.
DotPlot(hbca, features = non_immune_genes) + RotatedAxis()
DotPlot(hbca, features = lymphocytes_cluster_genes) + RotatedAxis()
## Warning: Could not find MS4A1 in the default search locations, found in RNA
## assay instead
## Warning: Could not find IGLC3 in the default search locations, found in RNA
## assay instead
## Warning in FetchData.Seurat(object = object, vars = features, cells = cells):
## The following requested variables were not found: CD19
DotPlot(hbca, features = unique(myeloid_cluster_genes)) + RotatedAxis()
## Warning: Could not find CLEC9A in the default search locations, found in RNA
## assay instead
## Warning: Could not find CD1C in the default search locations, found in RNA
## assay instead
## Warning in FetchData.Seurat(object = object, vars = features, cells = cells):
## The following requested variables were not found: LILRA4
DoHeatmap(subset(hbca, downsample = 100), features = non_immune_genes, size = 3)
## Warning in DoHeatmap(subset(hbca, downsample = 100), features =
## non_immune_genes, : The following features were omitted as they were not found
## in the scale.data slot for the SCT assay: TPRA1, ADIPOQ-AS1, ADIPOQ, APMAP
DoHeatmap(subset(hbca, downsample = 100), features = lymphocytes_cluster_genes, size = 3)
## Warning in DoHeatmap(subset(hbca, downsample = 100), features =
## lymphocytes_cluster_genes, : The following features were omitted as they were
## not found in the scale.data slot for the SCT assay: CTSW, IGLC3, SEL1L3, BANK1,
## CD79B, MS4A1, CD19, CD4
You can change the slot you are pulling from
DoHeatmap(subset(hbca, downsample = 100), features = myeloid_cluster_genes, size = 3)
## Warning in DoHeatmap(subset(hbca, downsample = 100), features =
## myeloid_cluster_genes, : The following features were omitted as they were not
## found in the scale.data slot for the SCT assay: CD1C, CLEC9A, IRF7, LILRA4,
## ANPEP, CD14, ITGAM
You can manually assign cluster names. Using this features, but some of the labels are more difficult than others.
For example, just based on what we have looked at you might start the preliminary labeling as this.
However this is likely not enough. There items you might want to remove, or change. And some cell types will not show up when you have so few cells. Which is why often this labeling step is left more for the later items, while you can get started it might not be the best labels to start.
new.cluster.ids <- c("Fibroblast_0", "Perivascular_1", "Underclustered_2",
"Endothelial_3", "Lymphocytes_4",
"Lum_Epithelial_5", "Basal_Epithelial_6",
"Myeloid_7")
names(new.cluster.ids) <- levels(hbca)
hbca <- RenameIdents(hbca, new.cluster.ids)
DimPlot(hbca, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
Note I named one of the clusters as undercluster. Why?
Note I names one of the clusters as Perivascular. Why might that be wrong?
When naming, we need to consider the different lists of the genes selected. For example we can look at the following gene lists using dot plots.
DotPlot(hbca, features = Epi_genes) + RotatedAxis()
#DotPlot(hbca, features = unique(kartins)) + RotatedAxis()
#DotPlot(hbca, features = Fibro_genes) + RotatedAxis()
#DotPlot(hbca, features = Tcell_genes) + RotatedAxis()
#DotPlot(hbca, features = Bcell_genes) + RotatedAxis()
#DotPlot(hbca, features = NKill_genes) + RotatedAxis()
#DotPlot(hbca, features = Immun_genes) + RotatedAxis()
#DotPlot(hbca, features = myeloid_genes) + RotatedAxis()
#DotPlot(hbca, features = non_immune_genes) + RotatedAxis()
#DotPlot(hbca, features = lymphocytes_cluster_genes) + RotatedAxis()
Or we could use the user marker information we provided using the user information.
DotPlot(hbca, features = unique(user_markers$Gene[user_markers$Celltype=="Macrophages"]), assay = "RNA") + RotatedAxis()
Or we could examine the same information using the heatmaps and using the top marker genes found earlier.
DoHeatmap(subset(hbca, downsample = 100), features = hbca.markers.top$gene, size = 3)
## Warning in DoHeatmap(subset(hbca, downsample = 100), features =
## hbca.markers.top$gene, : The following features were omitted as they were not
## found in the scale.data slot for the SCT assay: H3F3A
Or you can compare them base on a single cluster against the other
clusters. Such as the Perivascular or cluster 1 against the
other clusters.
cluster1.markers <- FindMarkers(hbca, ident.1 = 1, group.by = "seurat_clusters")
head(cluster1.markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## MYL9 7.622192e-285 1.0854276 0.812 0.136 1.392117e-280
## NOTCH3 1.640208e-278 0.6563979 0.606 0.031 2.995675e-274
## TAGLN 2.012137e-266 1.1953884 0.870 0.225 3.674968e-262
## ACTA2 1.430734e-245 0.9625597 0.646 0.067 2.613092e-241
## PLN 1.604620e-224 0.6259994 0.444 0.010 2.930678e-220
## LMOD1 3.555372e-216 0.5695649 0.500 0.029 6.493532e-212
# selected the ones with fold change greater than 0.8
cluster1.markers.top <- cluster1.markers %>%
dplyr::filter(avg_log2FC > 0.8)
DoHeatmap(subset(hbca, downsample = 100), features = rownames(cluster1.markers.top), size = 3)
Question: 1. What other names might be appropriate for these clusters? 2. How and why would you name this sample differently if you were using a different resolution? Specifically the low resolution?
You can save this individual output using RDS. This saves the specific object specified. However, be careful when saving data that you know how the data object was processed. It is best practice to the output flow from one script to the next and to be used. However, you also need to know what was done. This means keeping track of all processing steps. Often one the best ways to do this to do all your coding in items like markdown reports or use extensive comments.
Functions: 1. saveRDS - saves object. This does not
provide the name of the object you need to use readRDS to
get the object into a new R session. 2. .Platform$file.sep
- give the correct separator so you don’t have to worry.
dir.create("output")
file_separator <- .Platform$file.sep
saveRDS(hbca, file=paste0("ouput", file_separator, sample_name, "_basic_scRNA.RDS"))
For reproducible research, one item to keep in mind is always knowing what packages you used so someone else can use it. Therefore, a good function and method is to always output this at the end of your sessions to you know what you did and had.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.4.3 dplyr_1.1.3 patchwork_1.1.3 ggplot2_3.4.4
## [5] sp_1.5-0 SeuratObject_4.1.2 Seurat_4.2.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.1-0 deldir_1.0-6
## [4] ellipsis_0.3.2 ggridges_0.5.4 rstudioapi_0.14
## [7] spatstat.data_2.2-0 farver_2.1.1 leiden_0.4.3
## [10] listenv_0.8.0 bit64_4.0.5 ggrepel_0.9.1
## [13] fansi_1.0.5 codetools_0.2-18 splines_4.2.0
## [16] cachem_1.0.8 knitr_1.44 polyclip_1.10-0
## [19] jsonlite_1.8.7 ica_1.0-3 cluster_2.1.3
## [22] png_0.1-8 rgeos_0.5-9 uwot_0.1.16
## [25] shiny_1.7.5.1 sctransform_0.3.5 spatstat.sparse_2.1-1
## [28] compiler_4.2.0 httr_1.4.7 Matrix_1.5-1
## [31] fastmap_1.1.1 lazyeval_0.2.2 cli_3.6.1
## [34] later_1.3.1 htmltools_0.5.6.1 tools_4.2.0
## [37] igraph_1.5.1 gtable_0.3.4 glue_1.6.2
## [40] RANN_2.6.1 reshape2_1.4.4 Rcpp_1.0.11
## [43] scattermore_0.8 cellranger_1.1.0 jquerylib_0.1.4
## [46] vctrs_0.6.4 nlme_3.1-158 progressr_0.11.0
## [49] lmtest_0.9-40 spatstat.random_2.2-0 xfun_0.40
## [52] stringr_1.5.0 globals_0.16.1 mime_0.12
## [55] miniUI_0.1.1.1 lifecycle_1.0.3 irlba_2.3.5.1
## [58] goftest_1.2-3 future_1.28.0 MASS_7.3-57
## [61] zoo_1.8-11 scales_1.2.1 spatstat.core_2.4-4
## [64] promises_1.2.1 spatstat.utils_2.3-1 parallel_4.2.0
## [67] RColorBrewer_1.1-3 yaml_2.3.7 reticulate_1.26
## [70] pbapply_1.5-0 gridExtra_2.3 sass_0.4.7
## [73] rpart_4.1.16 stringi_1.7.12 rlang_1.1.1
## [76] pkgconfig_2.0.3 matrixStats_1.0.0 evaluate_0.22
## [79] lattice_0.20-45 ROCR_1.0-11 purrr_1.0.2
## [82] tensor_1.5 labeling_0.4.3 htmlwidgets_1.6.2
## [85] bit_4.0.4 cowplot_1.1.1 tidyselect_1.2.0
## [88] parallelly_1.32.1 RcppAnnoy_0.0.21 plyr_1.8.7
## [91] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
## [94] DBI_1.1.3 withr_2.5.1 mgcv_1.8-40
## [97] pillar_1.9.0 fitdistrplus_1.1-8 survival_3.3-1
## [100] abind_1.4-5 tibble_3.2.1 future.apply_1.9.1
## [103] hdf5r_1.3.6 KernSmooth_2.23-20 utf8_1.2.3
## [106] spatstat.geom_2.4-0 plotly_4.10.2 rmarkdown_2.25
## [109] grid_4.2.0 data.table_1.14.8 digest_0.6.33
## [112] xtable_1.8-4 tidyr_1.3.0 httpuv_1.6.11
## [115] munsell_0.5.0 viridisLite_0.4.2 bslib_0.5.1